EC 320, Lecture 7
07 2024
Up to this point, we have been focusing on OLS considering:
We have mostly ignored drawing conclusions about the true population parameters from the estimates of the sample data. AKA inference.
Thus far in this class we’ve fit an OLS model the following questions:
Though we’ve not discussed our confidence in our fitted relationship
Even if all 6 assumptions hold, sample selection might generate the incorrect conclusions in a completely unbiased, coincidental fashion.
A1. Linearity: The population relationship is linear in parameters with an additive error term.
A2. Sample Variation: There is variation in \(X\).
A3. Exogeniety: The \(X\) variable is exogenous
A4. Homoskedasticity: The error term has the same variance for each value of the independent variable
A5. Non-autocorrelation: The values of error terms have independent distributions
A6. Normality: The population error term in normally distributed with mean zero and variance \(\sigma^2\)
Previously we used the first 3 assumptions to show that OLS is unbiased:
\[ \mathop{\mathbb{E}}\left[ \hat{\beta} \right] = \beta \]
We used the first 5 assumptions to derive a formula for the variance of the OLS estimator:
\[ \mathop{\text{Var}}(\hat{\beta}) = \frac{\sigma^2}{\sum_{i=1}^n (X_i - \bar{X})^2} \]
By using the variance of the OLS estimator, we can infer confidence from the sampling distribution
The probability distribution of the OLS estimators obtained from repeatedly drawing random samples of the same size from a population and fitting point estimates each time.
Provides information about their variability, accuracy, and precision across different samples.
Point estimates
The fitted values of the OLS estimator (e.g., \(\hat{\beta}_0, \hat{\beta}_1\))
1. Unbiasedness: If the Gauss-Markov assumptions hold, the OLS estimators are unbiased (i.e., \(E(\hat{\beta}_0) = \beta_0\) and \(E(\hat{\beta}_1) = \beta_1\))
2. Variance: The variance of the OLS estimators describes their dispersion around the true population parameters.
3. Normality: If the errors are normally distributed or the sample size is large enough, by the CLT, the sampling distribution of the OLS estimators will be approximately normal.1
The sampling distribution of \(\hat{\beta}\) to conduct hypothesis tests.
Use all 6 classical assumptions to show that OLS is normally distributed:
\[ \hat{\beta} \sim \mathop{N}\left( \beta, \frac{\sigma^2}{\sum_{i=1}^n (X_i - \bar{X})^2} \right) \]
To “prove” this, recall our simulation from last time
Plotting the distributions of the point estimates in a histogram
Simulating 1,000 draws
Plotting the distributions of the point estimates in a histogram
Simulating 10,000 draws
Our current workflow:
1. Get data (points with \(X\) and \(Y\) values).
2. Regress \(Y\) on \(X\).
3. Plot the point estimates (i.e., \(\hat{Y_i} = \hat{\beta}_0 + \hat{\beta}_1X_i\)) and report.
But when do we learn something? We are missing a step.
We need to be careful about our sample being atypical. AKA uncertainty.
However, there is a problem.
Recall the variance of the point estimate \(\hat{\beta_1}\) \[ \mathop{\text{Var}}(\hat{\beta}_1) = \frac{\sigma^2}{\sum_{i=1}^n (X_i - \bar{X})^2} \]
The problem is that \({\color{#ffffff} \sigma^2}\) is unobserved. So what do we do? Estimate it.
Recall the variance of the point estimate \(\hat{\beta_1}\) \[ \mathop{\text{Var}}(\hat{\beta}_1) = \frac{{\color{#BF616A} \sigma^2}}{\sum_{i=1}^n (X_i - \bar{X})^2} \]
The problem is that \({\color{#BF616A} \sigma^2}\) is unobserved. So what do we do? Estimate it.
We can estimate the variance of \(u_i\) (\({\color{#BF616A} \sigma^2}\)) using the sum of squared residuals (RSS):
\[ s^2_u = \dfrac{\sum_i \hat{u}_i^2}{n - k} \]
where \(n\) is the number of observations and \(k\) is the number of regression parameters. (In a simple linear regression, \(k=2\).)
If the assumptions from Gauss-Markov hold, then \(s^2_u\) is an unbiased estimator of \(\sigma^2\).
In essence, we are learning from our prediction errors
With \(s^2_u = \dfrac{\sum_i \hat{u}_i^2}{n - k}\), we can calculate the estimated variance of \(\hat{\beta}_1\)
\[ \mathop{\text{Var}}(\hat{\beta}_1) = \frac{s^2_u}{\sum_{i=1}^n (X_i - \bar{X})^2} \]
Taking the square root, we get the standard error of the OLS estimator:
\[ \mathop{\hat{\text{SE}}} \left( \hat{\beta}_1 \right) = \sqrt{ \frac{s^2_u}{\sum_{i=1}^n (X_i - \bar{X})^2} } \]
The standard error is the standard deviation of the sampling distribution.
After deriving the distribution of \(\hat{\beta}_1\)1, we have two (related) options for formal statistical inference (learning) about our unknown parameter \(\beta_1\):
Systematic procedure that gives us evidence to hang our hat on. Starting with a Null hypothesis (\(H_0\)) and an Alternative hypothesis (\(H_1\))
\[ \begin{align*} H_0:& \beta_1 = 0 \\ H_1:& \beta_1 \neq 0 \end{align*} \]
In the context of the wage regression:
\[ \text{Wage}_i = \beta_0 + \beta_1 \cdot \text{Education}_i + u_i \]
\(H_0\): Education has no effect on wage
\(H_1\): Education has an effect on wage
Within this structure, four possible outcomes exist:
1. We fail to reject the null hypothesis and the null is true.
Ex. Education has no effect on wage and, correctly, we fail to reject \(H_0\).
Within this structure, four possible outcomes exist:
1. We fail to reject the null hypothesis and the null is true.
2. We reject the null hypothesis and the null is false.
Ex. Education has an effect on wage and, correctly, we reject \(H_0\).
Within this structure, four possible outcomes exist:
1. We fail to reject the null hypothesis and the null is true.
2. We reject the null hypothesis and the null is false.
3. We reject the null hypothesis, but the null is actually true.
Ex. Education has no effect on wage, but we incorrectly reject \(H_0\).
This is an error. Defined as a Type I error.
Within this structure, four possible outcomes exist:
1. We fail to reject the null hypothesis and the null is true.
2. We reject the null hypothesis and the null is false.
3. We reject the null hypothesis, but the null is actually true.
4. We fail to reject the null hypothesis, but the null is actually false.
Ex. Education has an effect on wage, but we incorrectly fail to reject \(H_0\).
This is an error. Defined as a Type II error.
Within this structure, four possible outcomes exist:
1. We fail to reject the null hypothesis and the null is true.
2. We reject the null hypothesis and the null is false.
3. We reject the null hypothesis, but the null is actually true.1
4. We fail to reject the null hypothesis, but the null is actually false.2
Or… from the golden age of textbook illustrations
Goal: Make a statement about \(\beta_1\) using information on \(\hat{\beta}_1\).
\(\hat{\beta}_1\) is random—it could be anything, even if \(\beta_1 = 0\) is true.
Hypothesis testing takes extreme values of \(\hat{\beta}_1\) as evidence against the null hypothesis, but it will weight them by information about variance the estimated variance of \(\hat{\beta}_1\).
\(H_0\): \(\beta_1 = 0\)
\(H_1\): \(\beta \neq 0\)
To conduct the test, we calculate a \(t\)-statistic1:
\[ t = \frac{\hat{\beta}_1 - \beta_1^0}{\mathop{\hat{\text{SE}}} \left( \hat{\beta}_1 \right)} \]
Distributed by a \(t\)-distribution with \(n-2\) degrees of freedom2.
Normal distribution vs. \(t\) distribution
Normal distribution vs. \(t\) distribution
Normal distribution vs. \(t\) distribution
Two sided t Tests
To conduct a t test, compare the \(t\) statistic to the appropriate critical value of the t distribution.
Reject (\(\text{H}_0\)) at the \(\alpha \cdot 100\)-percent level if
\[ \left| t \right| = \left| \dfrac{\hat{\mu} - \mu_0}{\mathop{\text{SE}}(\hat{\mu})} \right| > t_\text{crit}. \]
Next, we use the \(\color{#434C5E}{t}\)-statistic to calculate a \(\color{#B48EAD}{p}\)-value.
Describes the probability of seeing a \(\color{#434C5E}{t}\)-statistic as extreme as the one we observe if the null hypothesis is actually true.
But…we still need some benchmark to compare our \(\color{#B48EAD}{p}\)-value against.
We worry mostly about false positives, so we conduct hypothesis tests based on the probability of making a Type I error1.
How? We select a significance level, \(\color{#434C5E}{\alpha}\), that specifies our tolerance for false positives (i.e., the probability of Type I error we choose to live with).
To visualize Type I and Type II, we can plot the sampling distributions of \(\hat{\beta}_1\) under the null and alternative hypotheses
Type I vs Type II
We then compare \(\color{#434C5E}{\alpha}\) to the \(\color{#B48EAD}{p}\)-value of our test.
If the \(\color{#B48EAD}{p}\)-value is less than \(\color{#434C5E}{\alpha}\), then we reject the null hypothesis at the \(\color{#434C5E}{\alpha}\cdot100\) percent level.
If the \(\color{#B48EAD}{p}\)-value is greater than \(\color{#434C5E}{\alpha}\), then we fail to reject the null hypothesis at the \(\color{#434C5E}{\alpha}\cdot100\) percent level.1
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 18.4 2.38 7.75 1.06e-11
2 police 1.76 1.30 1.35 1.81e- 1
\(H_0\): \(\beta_\text{Police} = 0\)
\(H_1\): \(\beta_\text{Police} \neq 0\)
Significance level: \(\color{#434C5E}{\alpha} = 0.05\) (i.e., 5 percent test)
Test Condition: Reject \(H_0\) if \(p < \alpha\)
What is the \(\color{#B48EAD}{p}\)-value? \(p = 0.18\)
Do we reject the null hypothesis? No.
\(\color{#B48EAD}{p}\)-values are difficult to calculate by hand.
Alternative: Compare \(\color{#434C5E}{t}\)-statistic to critical values from the \({\color{#434C5E} t}\)-distribution.
Notation: \(t_{1-\alpha/2, n-2}\) or \(t_\text{crit}\).
Compare the the critical value to your \(t\)-statistic:
Based on a critical value of \(t_{1-\alpha/2, n-2} = t_{0.975, 100} =\) 1.98, we can identify a rejection region on the \(\color{#434C5E}{t}\)-distribution.
If our \(\color{#434C5E}{t}\)-statistic is in the rejection region, then we reject the null hypothesis at the 5 percent level.
Ex.1 \(\alpha = 0.05\)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.53 0.422 6.00 3.38e- 8
2 x 0.567 0.0793 7.15 1.59e-10
\(H_0\): \(\beta_1 = 0\)
\(H_1\): \(\beta_1 \neq 0\)
Notice that the \(\color{#434C5E}{t}\)-statistic is 7.15. The critical value is \(\color{#434C5E}{t_{\text{0.975, 28}}} = 2.05\).
Which implies that \(p < 0.05\). Therefore, we reject \(H_0\) at the 5% level.
Ex. Are campus police associated with campus crime? (\(\alpha = 0.1\))
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 18.4 2.38 7.75 1.06e-11
2 police 1.76 1.30 1.35 1.81e- 1
\(H_0\): \(\beta_\text{Police} = 0\)
\(H_1\): \(\beta_\text{Police} \neq 0\)
The \(\color{#434C5E}{t \text{-stat}} = 1.35\). The critical value is \(\color{#434C5E}{t_{\text{0.95, 94}}} = 1.66\).
|\(\color{#434C5E}{t \text{-stat}}| < |\color{#434C5E}{t_{\text{crit}}}|\) implies that \(p > 0.05\). Therefore, we fail to reject \(H_0\) at the 10% level.
We might be confident in a parameter being non-negative/non-positive.
One-sided tests assume that the parameter of interest is either greater than/less than \(H_0\).
Option 1 \(H_0\): \(\beta_1 = 0\) vs. \(H_1\): \(\beta_1 > 0\)
Option 2 \(H_0\): \(\beta_1 = 0\) vs. \(H_1\): \(\beta_1 < 0\)
If this assumption is reasonable, then our rejection region changes.
Left-tailed: Based on a critical value of \(t_{1-\alpha, n-2} = t_{0.95, 100} =\) 1.66, we can identify a rejection region on the \(t\)-distribution.
If our \(t\) statistic is in the rejection region, then we reject the null hypothesis at the 5 percent level.
Right-tailed: Based on a critical value of \(t_{1-\alpha, n-2} = t_{0.95, 100} =\) 1.66, we can identify a rejection region on the \(t\)-distribution.
If our \(t\) statistic is in the rejection region, then we reject the null hypothesis at the 5 percent level.
Ex. Do campus police deter campus crime? (\(\alpha = 0.1\))
Suppose we rule out the possibility that police increase crime, but not that they have no effect.
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 18.4 2.38 7.75 1.06e-11
2 police 1.76 1.30 1.35 1.81e- 1
\(H_0\): \(\beta_\text{Police} = 0\)
\(H_1\): \(\beta_\text{Police} < 0\)
Notice that the \(\color{#434C5E}{t \text{-stat}} = 1.35\). The critical value is \(\color{#434C5E}{t_{\text{0.9, 94}}} = 1.29\).
Which implies that \(p > 0.05\). Therefore, we reject \(H_0\) at the 5% level.
Until now, we have considered point estimates of population parameters.
We can construct \((1-\alpha)\cdot100\)-percent level confidence intervals for \(\beta_1\)
\[ \hat{\beta}_1 \pm t_{1-\alpha/2, n-2} \, \mathop{\hat{\text{SE}}} \left( \hat{\beta}_1 \right) \]
\(t_{1-\alpha/2,n-2}\) denotes the \(1-\alpha/2\) quantile of a \(t\) distribution with \(n-2\) degrees of freedom.
Q: Where does the confidence interval formula come from?
A: Formula is a result from the rejection condition of a two-sided test.
Reject \(H_0\) if
\[ |t| > t_\text{crit} \]
The test condition implies that we:
Fail to reject \(H_0\) if
\[ |t| \leq t_\text{crit} \]
or, \[ -t_\text{crit} \leq t \leq t_\text{crit} \]
Replacing \(t\) with its formula gives:
Fail to reject \(H_0\) if
\[-t_\text{crit} \leq \frac{\hat{\beta}_1 - \beta_1^0}{\mathop{\hat{\text{SE}}} \left( \hat{\beta}_1 \right)} \leq t_\text{crit} \]
Standard errors are always positive, so the inequalities do not flip when we multiply by \(\mathop{\hat{\text{SE}}} \left( \hat{\beta}_1 \right)\):
Fail to reject \(H_0\) if \[ -t_\text{crit} \mathop{\hat{\text{SE}}} \left( \hat{\beta}_1 \right) \leq \hat{\beta}_1 - \beta_1^0\leq t_\text{crit} \mathop{\hat{\text{SE}}} \left( \hat{\beta}_1 \right) \]
Subtracting \(\hat{\beta}_1\) yields
Fail to reject \(H_0\) if \[ -\hat{\beta}_1 -t_\text{crit} \mathop{\hat{\text{SE}}} \left( \hat{\beta}_1 \right) \leq - \beta_1^0 \leq - \hat{\beta}_1 + t_\text{crit} \mathop{\hat{\text{SE}}} \left( \hat{\beta}_1 \right) \]
Multiplying by -1 and rearranging gives
Fail to reject \(H_0\) if
\[ \hat{\beta}_1 - t_\text{crit} \mathop{\hat{\text{SE}}} \left( \hat{\beta}_1 \right) \leq \beta_1^0 \leq \hat{\beta}_1 + t_\text{crit} \mathop{\hat{\text{SE}}} \left( \hat{\beta}_1 \right) \]
Replacing \(\beta_1^0\) with \(\beta_1\) and dropping the test condition yields the interval:
\[ \hat{\beta}_1 - t_\text{crit} \mathop{\hat{\text{SE}}} \left( \hat{\beta}_1 \right) \leq \beta_1 \leq \hat{\beta}_1 + t_\text{crit} \mathop{\hat{\text{SE}}} \left( \hat{\beta}_1 \right) \]
which is equivalent to
\[ \hat{\beta}_1 \pm t_\text{crit} \, \mathop{\hat{\text{SE}}} \left( \hat{\beta}_1 \right) \]
Main insight:
Generally, a \((1- \alpha) \cdot 100\) percent confidence interval embeds a two-sided test at the \(\alpha \cdot 100\) level.
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.53 0.422 6.00 3.38e- 8
2 x 0.567 0.0793 7.15 1.59e-10
# find degrees of freedom
dof <- summary(lm(y ~ x, data = pop_df))$df[2]
# return critical value
qt(0.975, dof)[1] 1.984467
95% confidence interval for \(\beta_1\) is:
\[ 0.567 \pm 1.98 \times 0.0793 = \left[ 0.410,\, 0.724 \right] \]
We have a confidence interval for \(\beta_1\), i.e., \(\left[ 0.410,\, 0.724 \right]\)
What does it mean?
Informally: The confidence interval gives us a region (interval) in which we can place some trust (confidence) for containing the parameter.
More formally: If we repeatedly sample from our population and construct confidence intervals for each of these samples, then \((1-\alpha) \cdot100\) percent of our intervals (e.g., 95%) will contain the population parameter somewhere in the interval.
Going back to our simulation…
We drew 10,000 samples (each of size \(n = 30\)) from our population and estimated our regression model for each sample:
\[ Y_i = \hat{\beta}_1 + \hat{\beta}_1 X_i + \hat{u}_i \]
The true parameter values are \(\beta_0 = 0\) and \(\beta_1 = 0.5\)
Let’s estimate 95% confidence intervals for each of these intervals…
From our previous simulation, 97.9% of 95% confidence intervals contain the true parameter value of \(\beta_1\).
You can instruct tidy to return a 95 percent confidence interval for the association of campus police with campus crime:
Four confidence intervals for the same coefficient.
Hypothesis testing is an essential tool. Yet the traditional way of teaching hypothesis testing can be unintuitive.
It took me several tries (classes) to fully understand the concept
If you can program, you have direct access to the fundamental ideas in statistics
To demonstrate, consider hypothesis testing
In order to do that, we need a problem…
Does drinking beer make you more attractive to mosquitos?
Though it sounds silly, this research question is important
Here is the data. Treatment group in blue.
Treatment mean: 23.6 Control mean: 19.22
Difference in means: 4.38
Plot the true difference
Suppose the difference is coincidental. Then the labels don’t matter
Treatment mean: 23.6 Control mean: 19.22
Difference in means: 4.38
Suppose the difference is coincidental. Then the labels don’t matter
Treatment mean: 21.64 Control mean: 21.94
False difference in means : -0.3
Plot the “fake” difference
Labels don’t matter. Assign treatment randomly. Find the difference.
Treatment mean: 21.96 Control mean: 21.5
False difference in means : 0.46
Plot the difference
Labels don’t matter. Assign treatment randomly. Find the difference.
Treatment mean: 21.88 Control mean: 21.61
False difference in means : 0.27
Plot the differences
Labels don’t matter. Assign treatment randomly. Find the difference.
Treatment mean: 21.76 Control mean: 21.78
False difference in means : -0.02
Plot the differences
Plot all 2,500 differences.
EC320, Lecture 7 | Inference